In this notebook, we will conduct an exploratory data analysis and
linear regression with R using the Walmart sales data set from this
Kaggle link.
For that let’s load the important libraries for data analysis. Here we
will use pacman package for managing add on packages. If the
packages already exist, it will load them, otherwise it will download
and load the packages.
#use require() or library() to load the base packages
require(pacman) # gives a confirmation message
Loading required package: pacman
library(pacman) # load the package, but no confirmation message
# We can load all these packages at at time which are commonly used
pacman::p_load(pacman, dplyr, GGally, ggplot2, ggthemes,
ggvis, httr, lubridate, plotly, rio, rmarkdown, shiny,
stringr, tidyr)
# you can install the packages independently via " install.packages("package_name")
Now let’s read in the Walmart dataset and conduct some exploratory data analysis and visualizations. We will utilize the import function from rio library to import files like csv, xlsx, txt, etc. Other wise we need to use specific functions like read.csv, read.table, etc.
data1 <- import('../../datasets/Walmart.csv') # specify the path location
# Alternatively we could also use the read.csv(filepath, header = True) option
#data1 = read.csv('../../datasets/Walmart.csv', header = TRUE)
# disaplay the first 20 entries of the data
head(data1,20)
# dimension of the dataset
dim(data1)
[1] 6435 8
The dataset has 6435 rows and 8 columns which correspond to the following attribute
Store - the store number
Date - the week of sales
Weekly_Sales - sales for the given store
Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
Temperature - Temperature on the day of sale
Fuel_Price - Cost of fuel in the region
CPI – Prevailing consumer price index
Unemployment - Prevailing unemployment rate
Holiday Events<br /> Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13<br /> Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13<br /> Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13<br /> Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
```r
summary(data1)
```
```
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price
Min. : 1 Length:6435 Min. : 209986 Min. :0.00000 Min. : -2.06 Min. :2.472
1st Qu.:12 Class :character 1st Qu.: 553350 1st Qu.:0.00000 1st Qu.: 47.46 1st Qu.:2.933
Median :23 Mode :character Median : 960746 Median :0.00000 Median : 62.67 Median :3.445
Mean :23 Mean :1046965 Mean :0.06993 Mean : 60.66 Mean :3.359
3rd Qu.:34 3rd Qu.:1420159 3rd Qu.:0.00000 3rd Qu.: 74.94 3rd Qu.:3.735
Max. :45 Max. :3818686 Max. :1.00000 Max. :100.14 Max. :4.468
CPI Unemployment
Min. :126.1 Min. : 3.879
1st Qu.:131.7 1st Qu.: 6.891
Median :182.6 Median : 7.874
Mean :171.6 Mean : 7.999
3rd Qu.:212.7 3rd Qu.: 8.622
Max. :227.2 Max. :14.313
```
Since it it is little bit cluttered, let’s take a look at the weekly sales column.
summary(data1$Weekly_Sales)
Min. 1st Qu. Median Mean 3rd Qu. Max.
209986 553350 960746 1046965 1420159 3818686
# Let's get the unique store values
unique(data1$Store)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[38] 38 39 40 41 42 43 44 45
So there are 45 Walmart stores in this data set. We need to aggregate the data by store number and add the weekly sales to see if certain stores have more sales compared to others. In order to do this, we will utilize the group_by function from dplyr library. Let’s group the data by store number and store the sum of weekly sales into another data frame, gdf.
‘%>%’ is used to combine different functions in R.
gdf <- data1 %>% group_by(data1$Store) %>%
summarise(Total_sales = sum(Weekly_Sales))
gdf
# plot the sales as a function of store number
plot(gdf, col = 'blue', type = 'h', pch = 19, main = "Total Sales", xlab = "Store Number", ylab= "Sales")
As we can see, some of the stores have higher cumulative sales compared to others and this could be a regional factor as well. Now let’s see how the sales change as a function of date for a single store, e.g. store 1. For this we will use the plot_ly tool in the plotly library.
#plot the sales as a function of the date as well
fig <- plot_ly(data1, type = 'scatter', mode = 'markers')%>%
add_trace(x = data1$Date[data1$Store == 1], y = data1$Weekly_Sales[data1$Store == 1])%>%
layout(showlegend = F)
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6', width = 900)
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
fig
Warning: Can't display both discrete & non-discrete data on same axisWarning: Can't display both discrete & non-discrete data on same axis
Interesting there is a spike in the weeky sales during the time between Thanksgiving and Christmas in 2010 and 2011. For that we will group the data by date. Let’s plot the same for all stores here.
#plot the sales as a function of the date as well
fig <- plot_ly(data1, type = 'scatter', mode = 'markers')%>%
add_trace(x = data1$Date, y = data1$Weekly_Sales)%>%
layout(showlegend = F)
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6', width = 900)
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
fig
Warning: Can't display both discrete & non-discrete data on same axisWarning: Can't display both discrete & non-discrete data on same axis
If we look at the holiday events,
Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
We can clearly see an increase in Sales during the holiday season and it always reaches a peak during the time between Thanksgiving and Christmas.
Let’s make a scatter plot of Weekly sales and temperature.
plot( data1$Temperature, data1$Weekly_Sales, col = 'blue', main = 'Sales wrt Temp', ylab = "Weekly Sales", xlab = "Temperature")
The Weekly sales and temperature seems to be not correlate with each other. Let’ make do some more plotting in subplots format to look for correlations using the plotly library
#Initialize figures
fig1 <- plot_ly(x = data1$Holiday_Flag, y = data1$Weekly_Sales, type = 'scatter', name = 'holiday', mode = 'markers') %>%
layout(xaxis = list(title = 'Holiday Flag'), yaxis = list(title = 'Weekly Sales'))
fig2 <- plot_ly(x = data1$Fuel_Price, y = data1$Weekly_Sales, type = 'scatter', name = 'Fuel', mode = 'markers') %>%
layout(xaxis = list(title = 'Fuel Price'), yaxis = list(title = 'Weekly Sales'))
fig3 <- plot_ly(x = data1$CPI, y = data1$Weekly_Sales, type = 'scatter', name = 'CPI', mode = 'markers') %>%
layout(xaxis = list(title = 'CPI'), yaxis = list(title = 'Weekly Sales'))
fig4 <- plot_ly(x = data1$Unemployment, y = data1$Weekly_Sales, type = 'scatter', name = 'Unemployment', mode = 'markers') %>%
layout(xaxis = list(title = 'Unemployment'), yaxis = list(title = 'Weekly Sales'))
#creating subplot
fig <- subplot(fig1, fig2, fig3, fig4, nrows = 2, titleY = TRUE, titleX = TRUE, margin = 0.1 )
fig <- fig %>%layout(title = 'Weekly Sales wrt Different Factors',
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'), autosize = F, width = 900, height = 500)
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
fig
As we can see the weekly sales is not directly correlated with holiday flag, fuel price, CPI. The weekly sales goes down as the unemployment rates go up.
From our primary exploratory data analysis, what we can understand is that the Weekly sales mainly depend on the holiday time and the geographical location/store number in this data set. Also, the Sales are better during lower unemployment index.
Let’ see if the data has any missing values or Nan values before modeling the data. We will use the filter function to filter missing/Nan values and use the mutate to replace the bad values.
data1 %>%
summarise(count = sum(is.na(data1)))
This data was taken from Kaggle and does not contain any NA/Nan values. But we could introduce some Nan values and clean the data set.
data1[5,5] <- NA
data1[9,5] <- NaN
head(data1, 10)
Now let’s try again for NA/NaN values. is.na would check for both NA and NaN values while is.nan will only check for NaN values.
data1 %>%
summarise(count = sum(is.na(data1)))
#is.nan requires a list of data
data1 %>%
summarise(count = sum(is.nan(data1$Temperature)))
Let’s replace the NA/NaNs with the median values in the data set.
data1 <- data1 %>%
mutate(Temperature = replace(Temperature, is.na(Temperature), median(Temperature, na.rm = TRUE)))
head(data1, 10)
Before modeling the data, we need to convert the dates into a more meaning full numbers. In our case, rather than converting days into some numbers, we need it as a cyclic variable going from 1-365 as our sales are a function of different time of an year, especially the holiday time. Let’s write a function to do that.
#defining a function to convert the dates into day in a year
date_to_number <- function(dates){
num_date <- c()
#print(length(num_date))
for (i in seq(1:length(dates))){
date <- dates[i]
d <- strtoi(substr(date, 1, 2), 10) # getting the string values and converting to integers, using base 10 here.
m <- strtoi(substr(date, 4, 5), 10)
y <- strtoi(substr(date, 7, 10), 10)
num_date <- append(num_date, m*30 + d)
#cat(i, date, num_date[[i]], "\n")
}
return (num_date)
}
new_dates <- date_to_number(data1$Date)
#print(new_dates)
# add the new date numbers to the dataframe
data1 <- data1 %>%
mutate(date_number = new_dates)
head(data1,6)
# Now let's make a plot using ggplot to plot the sales as a fuction of the new date numbers we created
ggplot(data = data1, mapping = aes(y = Weekly_Sales, x = date_number, color = Holiday_Flag)) + geom_point() + labs(title = "Weekly sales wrt day in a year", x = "Nth day in a year", y = "Weeskly sales")
One interesting thing to note here is that, some of the high sales time between after Thanksgiving and before Christmas has been marked as not a holiday flag which might affect the modeling of the data.
Let’s build a correlation matrix first using the Pearson correlation coefficient.
data_new <- data1[-2] # removing the dates column
head(data_new)
#use the cor function to get the correlation of features in the data frame
res = cor(data_new)
round(res,2)
Store Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment date_number
Store 1.00 -0.34 0.00 -0.02 0.06 -0.21 0.22 0.00
Weekly_Sales -0.34 1.00 0.04 -0.06 0.01 -0.07 -0.11 0.07
Holiday_Flag 0.00 0.04 1.00 -0.16 -0.08 0.00 0.01 0.13
Temperature -0.02 -0.06 -0.16 1.00 0.14 0.18 0.10 0.24
Fuel_Price 0.06 0.01 -0.08 0.14 1.00 -0.17 -0.03 -0.04
CPI -0.21 -0.07 0.00 0.18 -0.17 1.00 -0.30 0.01
Unemployment 0.22 -0.11 0.01 0.10 -0.03 -0.30 1.00 -0.01
date_number 0.00 0.07 0.13 0.24 -0.04 0.01 -0.01 1.00
# Let's import the corrplot library for the visualization of the correlation
library(corrplot)
corrplot 0.92 loaded
corrplot(res, type = "full", order = "hclust",
tl.col = "black", tl.srt = 45)
In this correlogram, the radius of the circle represent the correlation strength and the colors represents the positive/negative correlation. As we can see, for the weekly sales has some correlation with the store number and it weakly/not correlated with the rest of features.
Before modeling of the data, let’s do principal component analysis (PCA) of the data for visualization and understand the correlation within the data set.
# using prcomp function for PCA
pc <- prcomp(data_new,
center = TRUE, # Centers means to 0 (optional)
scale = TRUE) # Sets unit variance (helpful)
# Get summary stats
summary(pc)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.2636 1.1461 1.0898 1.0803 0.9697 0.87475 0.75293 0.68023
Proportion of Variance 0.1996 0.1642 0.1484 0.1459 0.1176 0.09565 0.07086 0.05784
Cumulative Proportion 0.1996 0.3638 0.5122 0.6581 0.7756 0.87130 0.94216 1.00000
As you can see the variance is mostly spread out and the data is not much correlated.
#Screeplot for number of components
plot(pc)
# Get standard deviations and rotation
pc
Standard deviations (1, .., p=8):
[1] 1.2636174 1.1460754 1.0897825 1.0802661 0.9697365 0.8747479 0.7529284 0.6802261
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Store -0.58302714 0.06282082 0.1876872 0.1645521 -0.23624741 -0.32652943 0.65674624 -0.008894007
Weekly_Sales 0.37061204 -0.24412301 -0.5807988 -0.1815134 0.21885436 -0.05564072 0.61394405 -0.069684476
Holiday_Flag 0.04756473 -0.31022505 -0.1891669 0.5898542 -0.45713176 0.51900724 0.07426895 0.184206619
Temperature 0.01940246 0.75361202 -0.1550988 0.0101879 0.13179102 0.28400046 0.18263804 0.525500962
Fuel_Price -0.16554419 0.21550230 -0.3006047 -0.5389580 -0.61367045 0.23289369 -0.05610621 -0.333670674
CPI 0.47216154 0.30498145 0.4688994 0.1664354 -0.04469587 0.22873143 0.30464448 -0.537920392
Unemployment -0.51150062 0.02703337 -0.2094692 0.1523670 0.52771281 0.44046373 -0.03216957 -0.443868529
date_number 0.09007179 0.36345716 -0.4620599 0.5005512 -0.11349736 -0.48958352 -0.23641808 -0.295411396
# See how cases load on PCs
pre <- predict(pc) %>% round(2)
dim(pre)
[1] 6435 8
#plotting the first 2 components
plot(pre[,1], pre[,2], xlab = "Component 1", ylab = "Component 2")
# Biplot of first two components
biplot(pc)
As you can see, there the first 2 principal components only explains only 36% of the data and they don’t have any linear correlation as well. Here in the biplot, length of vectors denote how much it has contributed to the component and cos(angle between vectors) is proportional to the correlation between them. As you can see, the weekly sales and holiday flag are correlated.
Now let’s do the multivariate modeling of the data. For that let’s define the x and y data.
# Let's shuffle the dataset before splitting
data_shuff <- data1[sample(1:nrow(data1)),]
# define x and y values
x = data_shuff[c(-2, -3)]
x <- as.matrix(x)
y <- data_shuff$Weekly_Sales
# let's split the data into test, validatation and test datasets with 70:20:10 ratio
# In total the dataset has 6435 rows
xtrain <- x[1:4504,]
ytrain <- y[1:4504]
xval <- x[4505:5792, ]
yval <- x[4505:5792]
xtest <- x[5793:6435,]
ytest <- y[5793:6435]
# Now let's use a linear model on the test dataset first
reg_test <- lm(ytrain ~ xtrain)
reg_test # print the coefficients only
Call:
lm(formula = ytrain ~ xtrain)
Coefficients:
(Intercept) xtrainStore xtrainHoliday_Flag xtrainTemperature xtrainFuel_Price
1863945.7 -15307.6 29616.5 -1653.2 13577.8
xtrainCPI xtrainUnemployment xtraindate_number
-2207.5 -17256.6 489.7
summary(reg) # Inferential tests
Call:
lm(formula = ytrain ~ xtrain)
Residuals:
Min 1Q Median 3Q Max
-1117293 -382286 -50424 368738 2581225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1886018.94 92106.54 20.476 < 2e-16 ***
xtrainStore -15607.93 623.69 -25.025 < 2e-16 ***
xtrainHoliday_Flag 46375.55 31339.79 1.480 0.139
xtrainTemperature -2050.87 462.83 -4.431 9.59e-06 ***
xtrainFuel_Price 11078.03 17664.69 0.627 0.531
xtrainCPI -2070.47 221.04 -9.367 < 2e-16 ***
xtrainUnemployment -19246.32 4476.82 -4.299 1.75e-05 ***
xtraindate_number 510.84 83.88 6.090 1.22e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 521400 on 4496 degrees of freedom
Multiple R-squared: 0.1494, Adjusted R-squared: 0.1481
F-statistic: 112.8 on 7 and 4496 DF, p-value: < 2.2e-16
Let’ look at the actaual weekly sales and predicted weekly sales from the training data
pred_ytrain <- predict(reg, newdata = as.data.frame(xtrain))
for (i in seq(1:30)){
str <- sprintf("Actual : %f, predicted :%f \n", ytrain[i], pred_ytrain[i])
cat(str)
}
Actual : 322868.560000, predicted :738877.347774
Actual : 474917.980000, predicted :687083.684192
Actual : 558464.800000, predicted :1255890.453871
Actual : 1390174.630000, predicted :791220.679379
Actual : 1621109.300000, predicted :1142539.841583
Actual : 527041.460000, predicted :723911.795652
Actual : 2057406.330000, predicted :1000791.637183
Actual : 1456221.100000, predicted :1186111.622251
Actual : 1427383.240000, predicted :856593.283421
Actual : 510382.500000, predicted :683563.416374
Actual : 951244.660000, predicted :1236871.579658
Actual : 1916812.740000, predicted :1341968.926037
Actual : 1867345.090000, predicted :1488827.169563
Actual : 1566668.910000, predicted :1058992.276472
Actual : 337743.260000, predicted :795956.849997
Actual : 994801.400000, predicted :1238561.652670
Actual : 527117.810000, predicted :820051.230478
Actual : 547384.900000, predicted :1242107.343553
Actual : 1605491.780000, predicted :1258639.914511
Actual : 478503.060000, predicted :666424.099249
Actual : 1135577.620000, predicted :1151776.775310
Actual : 981646.460000, predicted :1081826.130229
Actual : 791356.900000, predicted :931743.818543
Actual : 774262.280000, predicted :1317583.147945
Actual : 1554806.680000, predicted :1225866.414391
Actual : 301893.630000, predicted :764234.623301
Actual : 940299.870000, predicted :880510.031155
Actual : 293953.080000, predicted :813546.634589
Actual : 1897429.360000, predicted :1368371.561523
Actual : 403217.220000, predicted :1146324.330808
The R statistics should be close to 1 and in our case we are getting 0.14. Also the residual error is really high. Maybe our simple multivariate regression model is not good enough for the prediction purpose here which is also evident after looking at the first 30 actual weekly sales and predicted sales. Feature engineering could have been done if some of the features exhibited some non-linear relationship with the Weekly sales.
We will revisit this problem with a polynomial regression and decision tree regression in the future.
# How to clear packages
#p_unload(dplyr, tidyr, stringr) # Clear specific packages
p_unload(all) # Easier: clears all add-ons
The following packages have been unloaded:
corrplot, tidyr, stringr, shiny, rmarkdown, rio, plotly, lubridate, httr, ggvis, ggthemes, GGally, ggplot2, dplyr, pacman
#detach("package:datasets", unload = TRUE) # For base packages
# Clear console
#cat("\014") # ctrl+L